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Chapt  er  1 


I NT  ROD  ''.TION 

Statement  of  the  Problem 

Larae  computer  systems  have  historically  been 
centered  around  a single  centrally  located  computer 
system.  Users  have  been  serviced  through  the  central 
system  regardless  of  their  local  needs  or  geographical 
location.  ^ore  recently  the  users  have  been  effective 
in  bringing  about  a recognition  that  their  problems 
might  be  better  solved  with  distributed  computer  power 
because  they  as  a group  are  distributed.  Although  there 
is  still  argument  for  servicing  distributed  users  with  a 
central  single  computer  system.  there  is  now  a trend 
toward  systems  in  which  the  information  processing  and 
storage  functions  are  distributed  among  several 
computers.  With  this  distribution  there  has  been 
increased  interest  in  the  use  of  distributed  database 
systems.  The  reason  for  this  interest  lies  in  the  fact 
that  distributed  database  systems  provide  a solution  to 
some  very  real  problems  of  a geographically  distributed 
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organization.  These  organizations  must  maintain  a 
unified  information-sharing  and  processino  system. 

The  computing  Industry  Is  becoming  more  and  more 
cognizant  of  the  vital  corporate  resource--dat a , A 
problem  shared  bv  both  centralized  and  distributed 
database  systems  is  that  of  keeping  track  of  what  Is  in 
the  database  and  where  It  might  be  located.  The  best 
and  most  recently  developed  tool  for  assisting  a 
corporation  In  the  location  and  use  of  company  data  is 
the  data  dictionary. 

The  data  dictionary  Is  a facility  which  contains 
defining  information  about  the  data  held  in  an 
information  database.  Data  dictionaries  have  taken  many 
forms  from  single  card  files  to  larqe*  complex  computer 
stored  and  accessed  files.  The  last  few  years  have  seen 
the  growth  of  data  dictionaries  used  to  assist  in  the 
management  and  control  of  large  database  systems  runninq 
on  centralized  systems.  In  as  much  as  data  dictionaries 
in  this  environment  become  very  large*  a data  dictionary 
is  a database  about  the  database  [1*2*3],  As  computer 
systems  and  databases  are  distributed  it  will  become 
essential  to  understand  the  use  of  the  data  dictionary 


in  the  distributed  environment 


The  purpose  of  this  thesis  is  to  develop  a data 


dictionary  description  for  use  in  a distributed 
integrated  database  system. 

Problem  Justification 

At  the  present  time  there  are  approximately 
seven  marketed  data  dictionary  systems.  Only  one  of 
these  systems  is  termed  fully  integrated  . The  other 
six  are  referred  to  as  free-standing.  According  to 
lefkovits  (51  , a free-standing  dictionary  system  is 
unknown  to  the  operating  system,  language  processors,  or 
the  Database  Management  System  (DBMS),  On  the  other 
hand,  the  DBMS,  language  processors,  and  operating 
system  are  fully  aware  of  the  existence  of  an  integrated 
data  dictionary  system.  In  addition,  there  is  no 
evidence  in  the  literature  of  a data  dictionary 
developed  for  a distributed  system,  especially  an 
integrated  data  dictionary. 

Problem  Limitations 

Distribution  of  a database  takes  on  several 
forms  and  in  this  thesis  will  thus  be  limited  to 
horizontal  distribution.  In  a horizontal  distribution 
the  computers  may  be  physically  different  and  of  unlike 
capacity  and  power,  but  they  are  logically  egual  with 
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reqards  to  a hierarchical  structure.  They  interconnect 
on  the  same  level.  In  a vertical  d i s t r i fcu t i on , t^e 
computers  are  hierarchically  distributed  with  the 
smaller  computers  being  more  specifically  task  oriented. 
(See  figure  1.1.)  In  addition*  the  discussion  will  be 
limited  to  homogeneous  systems.  These  systems  have  the 
same  hardware.  In  most  cases  the  application  is 
applicable  to  heterogeneous  networks.  In  these  networks 
the  hardware  may  be  very  different  from  installation  to 
installation.  It  is  not  the  intent  of  this  thesis  to 
provide  an  implementation  of  a distributed  dictionary 
system.  Rather,  the  thesis  will  deal  with  the 
description  of  the  data  dictionary  and  the  various  ways 
the  dictionary  may  be  distributed  in  the  system. 

The  use  and  definition  of  a data  dictionary  will 
be  proposed  with  particular  attention  to  the  specific 
features  and  content.  Next  the  logical  representation 
will  be  discussed  followed  by  a discussion  of  five 
possible  physical  representations  for  the  database. 
Four  proposed  organizations  will  be  presented  listinq 
their  advantages  and  disadvantages.  These  four 
configurations  have  been  simulated  and  the  results  will 
be  presented  and  discussed.  Last  of  all, 
recommendations  will  be  made  and  conclusions  drawn. 


Cl 

DATA  DICTIONARY 

The  purposes  of  a data  dictionary  are  to  provide 
a means  of  control  on  how  the  data  are  to  be  used,  to 
provide  a more  complete  form  of  documentation,  and 
maintain  certain  characteristics  about  each  entity. 
There  is  not  just  one  way  to  accomplish  this.  In  fact* 
certain  of  the  attributes  of  the  dictionary  may  be 
determined  by  the  specific  system  being  used  or  the 
needs  of  the  users.  Generally  speaking,  however,  the 
data  dictionary  will  contain  entities  such  as  Data 
Files,  Data  Fields,  Programs,  Databases,  Systems,  Users, 
Departments,  Security  Levels,  and  relationships 
established  between  these  entities  t6,7,Bl. 

Each  of  these  entities  will  have  associated  with 
it  the  attributes  or  characteristics  that  have  been 
determined  important.  These  will  be  determined  by  the 
Database  Administrator  after  conferring  with  the  users 
of  the  system,  Lefkovits  19)  suggests  that  the  domain 
of  the  data  dictionary  consists  of  three  entity  types. 
First,  a data  entity,  which  includes  data  items 


C e 1 ement  s ) , 


groups 


(data  aggregates),  and  files 
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Second#  a processing  entity#  which  includes  system 
modules#  programs#  subprograms#  ana  systems;  and 
databases.  Third#  a usage  entity#  such  as  users# 
departments#  or  oraani zatons,  Cullinane  (10J  suggests 
that  there  are  seven  major  categories.  These  include 
users  and  systems#  programs#  elements  (consisting  of 
aroups  and  items)#  records#  files#  classes#  and 
attributes.  Users  include  projects#  departments#  and 
individuals.  Systems  include  subsystems  and  systems; 
and  programs  include  programs#  subprograms  and  modules. 
Classes  deal  with  security  of  the  entity#  language# 
mode#  f reauency#  and  privacy.  Attributes  will  be 
discussed  at  length  below,  Fiaure  2.1  shows  how  these 
entities  are  related. 

The  attributes  of  a data  item  (element)  might 
include:  1.  Official  name--the  name  by  which  it  is 
known  to  the  dictionary#  2,  Synonym  name--the  name  to 
which  it  is  commonly  referred#  3.  Alias  names  — other 
names  besides  the  synonym  name  to  which  it  may  be 
referred#  4.  Description#  5,  Usage— how  the  item  is 
used.  It  may  be  a numeric  field#  alphanumeric#  or 
another  form#  6.  Program  Language  Names--CO0OL # 
Fortran,  Assembly#  etc,#  7.  Version— a version  number 


of  the  dictionary#  8,  Sec u r i t y--sec u r i t y level  needed 


Figure  2,1  Relationships  between  Data 
Dictionary  entities 


to  access  data  instances*  and  9.  C 1 as  s i f i c a t i on— ba  t ch  * 
on-line*  etc.  Other  attributes  considered  essential 
could  also  be  included,  A data  group  might  include 
other  attributes  such  as  Content  s—names  of  item(s) 
and/or  group(s)  that  make  up  the  group*  and 
Pos i t i on--t he  alignment  of  the  parts  within  the  group, 
A record  would  include  other  attributes  such  as  Access 
and  Primary  Key,  Files  would  specify 


i 


Structure--seouentia1  * 


hashed*  etc,* 


and 


Sorting 
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Order--ascendi nq»  descending,  or  most  frequently 
accessed.  Programs  would  Include  attributes  like  Source 
Lanquaqe,  P r oq r amme r-- t be  person  responsible  for  the 
program,  and  Characteristics, 

Logical  Representation 

Once  the  attributes  essential  to  this  system 
have  been  determined*  user  views  of  the  data  or 
subschemas  must  be  drawn  up.  This  means  how  the  data  Is 


to  be  viewed  by 

the  specific 

users  of  the 

data 

dictionary  system. 

The  users 

for 

an 

Integrated 

data 

dictionary  system 

consist 

of 

the 

f o 1 lowing: 

1 . 

Terminal  users  making  requests 

on- 

line. 

2,  Batch 

users 

oeneratlng  reports, 

and  3 , 

Programs 

comp! 1 1 ng 

and 

assembllnq.  Figure  2.2  represents  this  kind  of 
activity.  The  dictionary  must  be  able  to  Interact  with 
all  three  user  classes.  Figure  2,3  represents  a simple 
user's  view  that  mlqht  be  supported  by  the  data 


dictionary. 
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Once  all  the  user  views  of  the  Data  in  the  data 
dictionary  have  been  determined  then  a loaical  view  of 
the  entire  system  can  be  developed,  Appendix  A 
represents  all  the  user  views  that  are  supported  in  the 
data  dictionary  in  this  thesis.  Each  individual 
subschema  is  represented  in  the  over-all  loqical  view  of 
the  data  dictionary. 


After  the  logical  view  of  the  data  dictionary  is 
developed  the  followinq  entities  and  information  are 
available  in  the  dictionary: 


1,  A dictionary  name--this  is  the  entity 

name  determined  by  the  database 

administrator.  Entities  consist  of 

databases,  data  elements#  data  records# 
files#  proqrams#  and  systems, 

2,  A synonym  name"”the  name  the  entity  is 
commonly  referred  to  by.  It  may  be  the  same 
as  the  dictionary  name, 

3,  An  English  lanquaqe  description  of  the 
entity, 

<1 , How  the  item  is  used:  numeric# 

alphanumeric#  etc, 

5,  The  number  of  digits  or  characters  used 
in  representing  data  elements, 

6,  A version  number  for  the  dictionary 
specifications  of  the  entity  concerned, 

7,  The  security  level  necessary  to  access 
instances  of  the  entity, 

8,  The  name  of  the  entity  as  it  is  used  in 
the  various  programming  languages#  such  as# 
Fortran  and  COBOL, 


! 
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9,  The  date  the  entity  was  created. 

10,  The  date  the  entity  was  last  updated. 

11,  The  classification  of  the  entity# 

whether  it  is  batch#  on-line#  test#  or 
Droduct i on. 

12,  How  the  records  are  accessed; 
seauential#  random#  etc, 

13,  The  primary  key  used  to  access  the 
records , 

Id,  How  the  file  is  structured;  i.e, 

indexed  sequential#  hashed#  binary  tree#  etc, 

15,  The  contents  of  each  file#  record#  and 
aggregate#  and  how  the  contents  are  aligned 
or  positioned  in  the  entity, 

16,  The  source  language  of  the  programs; 
i.e.  BASIC#  APL#  etc, 

17,  The  name  of  the  programmer  responsible 
for  the  program, 

18,  The  character i st i cs  of  the  program; 

i.e,  files  used#  output  generated#  and 
processing  time  required. 

19,  The  names  of  the  various  users  that  use 
the  dictionary. 

20,  The  name  of  the  database  admi ni st rator 
and  information  about  him  that  will  help  in 
contactinq  him, 

21,  Alias  names»*other  names  entitles  may  be 
referred  to  by. 


This  information  allows  a user#  who  is  writing  a 
new  program#  to  ascertain  whether  specific  items  already 
exist  and#  if  so#  where  they  are  found.  For  example# 


list  all  the  programs  that  use  the  data  aggregate 
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home-address  • nr(  nt  1 nq  source  code#  COBOL-name,  and 
proaram  description.  If  the  user  is  not  sure  of  the 
correct  name  then  he  may  ask  for  the  dictionary  name  and 
supply  an  alias  name.  A department  may  be  considering 
changing  the  length  of  the  inventory  number  from  six 
digits  to  ten  digits,  A report  can  be  produced  listing 
all  the  proarams  that  would  be  affected  by  this  change, 
A system  programmer  may  need  to  contact  all  users  using 
the  Heap  sort  routine  because  it  is  not  sorting 
correctly.  This  information  is  contained  in  the 
database  and  can  be  obtained  on  demand.  The  dictionary 
not  only  serves  as  a tremendous  source  of  information, 
but  it  also  provides  control  and  standardization  of  data 
names  across  systems,  programs,  and  users. 

The  attributes  and  subschemas  were  organized 
and,  following  a process  described  by  Martin  , were 
reduced  to  a canonical  schema  of  the  dictionary  database 
till.  Table  2.1  summerizes  the  steps  presented  by 
Martin  to  arrive  at  a canonical  form  and  fiqure  2 , <1 
represents  the  canonical  schema  for  this  data 
dictionary,  A canonical  schema  is  a model  of  data  which 
represents  the  inherent  data  structure  but  is 
independent  of  any  application.  It  is  also  independent 
of  the  software  or  hardware  which  is  employed  in 
representing  and  using  the  date  tl  2 1 • 


1,  Take  the  first  user's  view  of  the  data  and 
draw  it  in  the  form  of  a bubble  chart. 

2,  Take  the  next  user's  view,  represent  it  as 
above  and  merge  it  with  the  first  user's  view, 

3,  From  the  resulting  graph  distinguish  between 
attribute  nodes  and  primary  keys.  Park  the 
primary  keys. 

4,  For  each  association  between  keys,  add  the 
inverse  if  it  is  not  already  present. 

5,  Examine  the  association  and  remove  any 
genuinely  redundant  associations, 

6,  Repeat  steps  2-5  until  all  user  views  are 
merged  into  the  graph. 

7,  Identify  the  root  key. 

8,  Eliminate  isolated  attributes, 

9,  Modify  the  graph  to  eliminate  intersecting 
at  t r i butes , 

10,  Redraw  the  data  items  into  groups  (records, 
tuples). 

11.  Identify  all  secondary  keys, 

12.  The  unconstrained  "canonical"  schema  may  now 
be  represented  in  a particular  software  package. 

13,  Return  to  the  original  user  views  and  make 
sure  they  are  represented  correctly  in  the 
canonical  schema. 


Table  2.1  Canonical  schema  design  procedure 
according  to  Partin  (111 


Figure  2.4  Canonical  Schema 


6 


The  canonical  schema*  when  it  is  implemented* 
can  be  represented  in  CODASYL  Data  Hase  Task  Group-based 
software*  in  IBM's  OL/I-based  software  or  as  a 
relational  database.  When  it  is  represented  in  an 
appropriate  database  management  system*  the  canonized 
schema  provides  the  oest  facility  for  future  changes. 
Most  future  changes  can  be  accommodated  by  an 
incremental  qrowth  of  the  canonical  schema  without 
massive  restructuring  [13],  In  addition*  the  database 
administrator  of  any  system  can  represent  this  canonical 
schema  in  the  database  management  system  of  his  choice. 
This  applies  equally  to  homogeneous  and  heterogeneous 
networks*  which  demonstrates  the  flexibility  of  tMs 
technique.  The  canonical  schema  of  figure  2,4  is 
represented  in  CODASYL*  DL/I*  and  relational*  in  figures 
2,5*  2,6*  and  2.7  respectively. 


For  the  purpose  of  simulation,  this  canonical 
schema  was  implemented  in  Image,  the  DBMS  for  the  HP 
3000  version  2 at  Brigham  Young  University  as  presented 
in  Appendix  8.  The  canonical  schema  is  a minimal 
structure  and  not  all  software  packaqes  are  able  to 
represent  this  structure.  Thus  it  may  be  necessary  to 
deviate  from  the  canonical  schema  by  introducing 


redundancy.  Such  is  the  case  with  Image/3000. 


r 


v. 
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Physical  Representation 

Reqardless  of  the  logical  representation  used* 
numerous  physical  organizations  exist.  Care  must  be 
taken  when  determining  the  physical  organization*  since 
this  will  qreatly  affect  machine  performance.  The 
selection  of  physical  organization  is  largely  determined 
by  the  need  for  operational  efficiency#  fast  response 
times#  and  cost  minimization, 

Martin  [1*0  suggests  four  steps  to  identify 
those  areas  of  hiah  usage  and  fast -response  paths. 
First#  mark  all  paths  in  the  canonical  schema  which  will 
be  used  on  interactive  systems  and  reouire  fast  response 
time.  Second#  determine  the  number  of  times  a user  path 
will  be  followed  in  a given  amount  of  time.  Third# 
estimate  the  length  of  each  group.  Fourth#  for  each 
one-to-many  association  estimate  how  many  there  will  be. 
The  results  of  this  analysis  may  effect  the  choice  of 
structure  or  result  in  a modification  to  the  schema. 

For  an  integrated  data  dictionary#  response  time 
and  cost  are  of  utmost  importance  and#  therefore#  may 
dictate  random  access  over  sequential.  In  addition#  an 
organization  should  be  selected  that  handles  numerous 


insertions  and  deletions#  and  facilitates  expansion  or 
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Qpowth,  Other  factors  inf luencino  the  oesirert  physical 
orqani zst ion  are  blocking#  data  compaction#  frequency  of 
reference#  clustered  insertions#  and  mul  1 1 pi  e-key 
retrieval  , 

In  order  to  handle  the  numerous  user  views  in 
this  database  and  mu  1 t i D 1 e- key  retrieval#  a physical 
orqanization  tailored  toward  a fully  inverted  database 
will  be  necessary.  Five  possible  physical  organizations 
are  presented  here, 

,v  First  is  an  inverted  list  organization.  The 

indexes  contain  the  entire  key  as  opposed  to  a key  ranqe 
which  results  in  larger  indexes  but  does  not  increase 
the  number  of  entries  in  the  indexes.  It  increases  the 
resolving  power  of  the  index  and  the  number  of  inauiries 
that  can  be  answered. 

Second  is  an  inverted  lists  with  indirect 
addressing.  This  orqanization  is  similar  to  the  first 
except  none  of  the  secondary  indexes  have  machine 
addresses.  This  means  that  when  the  data  files  are 
reorqani zed  and  the  physical  addresses  are  changed  only 
the  primary  index  needs  to  be  rewritten.  Also  more 
questions  can  be  answered  usinq  the  indices  Instead  of 
ooing  to  the  data  files  because  the  full  key  is  used 
instead  of  a truncated  key. 
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Third  is  a bucket-resolved  inverted  list 
oraani zat i on.  In  an  attempt  to  reduce  the  size  of  the 
indices,  the  indexes  point  to  the  buckets  that  contain 
the  values  for  that  key.  The  size  of  the  bucket 
directly  affects  the  index  savings,  A disadvantage  is 
that  fewer  Queries  can  be  answered  in  the  indexes 
because  keys  ray  be  abbreviated  or  truncated. 

Fourth  is  a buc ke t - r eso 1 ved  inverted  list 

organization  represented  by  bit  strings.  The  bit 
strings  are  used  to  represent  the  buckets  the  contents 
r ' are  located  in.  Bit  string  representation  is  more 

compact  than  other  methods.  The  savings  is  also 
affected  by  the  number  and  size  of  the  buckets.  If  the 

: 

data  records  are  sorted  so  that  like  values  for  the 
indexes  are  in  the  same  bucket,  then  the  number  of 
buckets  searched  will  be  reduced.  Sorting  increases  the 
complexity  of  file  maintenance  and,  therefore,  is 
desirable  only  when  files  are  not  updated  on-line, 

E 

Fifth  is  an  organization  with  secondary  keys 
removed  from  the  data  files  into  a chained  index.  In 
this  method,  the  secondary  keys  are  removed  from  the 
data  records  and  stored  in  the  indexes.  This 
organization  permits  more  Queries  to  be  answered  without 


searching  the  data  records.  Chains  are  contained  in  the 


indexes  which  facilitate  Quicker 


sea  rc  h i nq 


Also 


maintenance  is  simplified  because  it  is  handled  in  the 
smaller  index  files.  A disadvantage  of  this  method  is 
that  if  the  pointers  are  damaged  it  miqht  result  in  a 
loss  of  the  data  record  associated  with  that  secondary 
key.  To  eliminate  this  potential  problem  the  secondary 
keys  could  be  carried  in  the  data  records  as  well. 

Another  determining  factor  that  affects  access 
time  is  how  the  indexes  are  searched.  A number  of 
techniques  are  used.  These  include:  serial  scan*  block 
search*  binary  search*  binary-tree  search*  balanced  tree 
index*  unbalanced  tree  index*  algorithm  index*  hash 
index*  look-aside  buffer*  and  sequence  set  chain  [151. 
A hash  index  was  determined  to  be  the  best  technique  to 
support  a data  dictionary  database. 

A hash  index  does  not  have  the  advantage  of  a 
pure  hash  because  records  may  not  be  found  in  one  seek. 
However*  it  allows  records  to  be  stored  by  some  other 
order.  Records  could  be  stored  sequentially  by  primary 
key  or  with  their  parents  in  a tree  structure.  The 
empty  spaces  in  the  buckets*  typical  of  hashing 
techniques*  are  in  the  smaller  indexes  and  not  in  the 
data  storage.  It  is  less  wasteful  than  pure  hashing  and 


h a n d 1 i n a overflows  is  less  time  consuming  because  they 
are  in  the  indexes. 


The  data  dictionary  database  will  be  substantial 
in  size  and  the  organization  that  will  provide  efficient 
use  of  storage,  speed  in  access,  and  conform  the  best  to 


the  specific  unit  characteristics  should  be  used. 

The  data  dictionary  described  here,  derived  from 
the  various  user  views,  contains  important  information 
about  every  data  item,  aggregate,  record,  file,  program, 
system,  user,  and  database  an  installation  may  want.  It 
is  designed  to  handle  practically  any  reouest 
imaginable.  For  example:  List  all  the  data  elements 
used  in  the  payroll  system;  List  all  the  programs 
written  in  Fortran  and  containing  the  data  aggregate 
Date;  List  all  the  alias  names  for  the  data  item 
soc i a 1 -secur i t y ; List  the  contents  of  all  records  in 
the  student  system  and  using  Name  as  primary  key;  List 
all  the  programs  John  Doe  is  responsible  for,  printing 
program  name,  source  code,  characteristics,  date 
created,  date  updated;  etc. 


Chapter  3 


DISTRIBUTION 

The  data  dictionary  is  meta-data  about  the 
database  and  generally  speaking,  will  approach  the  size 
of  a corporation  database.  In  as  much  as  the  dictionary 
is  a repository  for  an  integrated  database  which  is 
geographically  distributed*  the  proper  distribution  of 
the  dictionary  for  fast  response  is  essential.  Four 
different  distributions  are  presented.  They  are:  1, 
Central  data  dictionary;  2.  Fully  redundant  data 
dictionary;  3,  Partitioned  data  dictionary;  and  <l , 
Partially  replicated  data  dictionary. 

Central 

The  centralized  data  dictionary  has  only  one 
complete  data  dictionary,  (See  figure  3,1.)  In  order 
for  any  of  the  other  installations  in  the  network  to 
access  the  dictionary,  it  must  obtain  this  information 
through  data  communication  lines.  This  means  that  there 
must  be  at  least  one  data  communication  line  between  the 
installation  with  the  dictionary  and  every  other 
installation.  Those  installations  which  require  a lot 
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of  information  from  the  dictionary  may  require  more  than 
one  data  communication  line. 

Fully  Redundant 

The  fully  redundant  system  requires  that  each 
installation  has  a complete  data  dictionary,  (See 
figure  3.?,)  No  accessing  is  necessary  beyond  the  local 
installation.  The  data  communication  lines  are  needed 
to  insure  that  the  dictionaries  are  identical  and 
reflect  the  latest  updates, 

P a r 1 1 1 < oned 

The  partitioned  data  dictionary  system  is 
segmented  into  physically  disjoint  sections,  (See 
fiqure  3,3.)  Each  of  these  Local  Dictionary  Directories 
(LDD)  is  geographically  distributed  in  the  network  at 
the  installations  that  reouire  this  information.  The 
separate  sections,  because  of  their  interrelationships, 
logically  form  a single  database.  Any  request  for 
non-loCal  data  results  in  a remote  message  broadcast 
throughout  the  network  to  retrieve  the  desired  data. 
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Partially  Replicated 

In  the  partially  replicated  data  dictionary  the 
data  dictionary  consists  of  two  parts:  the  Global 
Dictionary  Directory  (GDD)  and  the  Local  Dictionary 
Directory  (LDD),  (See  fiqure  3,4.)  Fach  Global 
Dictionary  Directory  contains  a complete  list  of  all  the 
entries  in  thp  dictionary*  a pointer  to  the  Local 
Dictionary  Directory  which  the  details  are  found  in*  and 
the  security  level  required  to  access  instances  of  the 
entity.  The  Local  Dictionary  Directory  is  the 
Partitioned  dictionary  database  for  that  installation, 
A reauest  for  non-local  data  will  qenerate  a remote 
messaqe  to  the  specific  installation  where  the  data  is 
stored. 


Chapter  4 


SIMULATION 

In  order  to  simulate  these  proposed 
c on f i ou r a t i ons * (See  Appendix  C),  It  was  necessary  to 
answer  a number  of  Questions,  For  instance*  how  much 
time  Is  needed  to  perform  a disk  access*  how  many  disk 
accesses  are  reaul red  to  insert  a complete  entry  Into 
the  data  dictionary*  to  delete  an  entry,  and  to  modify 
one.  In  addition*  it  was  necessary  to  determine  how 
much  time  was  reaui red  to  send  an  inouiry  to  a remote 
i nst  a 1 1 at i on  , 

Disk  Access  Time 

The  time  to  perform  a disk  access  was  determined 
from  observing  a large  number  of  I/O  operations  on  the 
system.  The  time  ranged  from  40  msec  to  80  msec.  Fifty 
milliseconds  was  used  in  the  simulations,  to  determine 
the  number  of  disk  accesses  for  updates  and  retrievals* 
the  Image  schema  for  the  data  dictionary  database  was 
analyzed  for  disk  access  operations.  The  Image  schema 
structure  provides  two  types  of  sets:  master  and 
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r t 

detail,  Fiaure  U.l  depicts  a partial  representation  of 
the  canonical  schema  of  the  data  dictionary  in  Imaoe 
di  aorair  form. 


Figure  4,1  Partial  Image  schema  diagram 

A master  file  is  basically  an  inverted  file  and  may 
reference  from  one  to  sixteen  detail  sets,  A detail  set 
may  have  from  one  to  sixteen  search  items.  Image  uses  a 
hash  function  to  store  the  entries  in  the  master  file 
and  then  chains  the  pointers  to  the  detail  occurrences. 
Data  is  freguently  stored  r edundant 1 y , 1 1 6] 


The  simulated  dictionary  consisted  of 


3* 

four 


data  entities--elements*  aoqreQatrs,  records  and  files; 
three  process  rntities**proQrams,  systems*  and 
databases;  and  one  user  en t i t y-- i nd i v i dua 1 s . A number 
of  inauiries  from  each  of  these  entities  were  checked 
against  the  Image  schema  representing  deletes*  inserts* 
modifies*  and  retrievals.  These  inquiries  were  to 
determine  the  averaqe  number  of  disk  accesses  that  could 
be  expected  in  each  case.  The  minimum  number  and 
maximum  number  of  disk  accesses  were  also  determined  for 
each  ent i tv , 

Using  the  amount  specified  in  the  capacity 
attribute  of  the  Image  scnema*  a weighted  average  for 
the  minimum  and  maximum  disk  accesses  for  an  insert  were 
computed.  Both  the  minimum  and  maximum  values  were  then 
multiplied  by  the  50  msec  disk  access  time  previously 
determined*  to  provide  a minimum  and  maximum  access  time 
for  inserts.  This  procedure  was  repeated  for  deletes* 
modifying  updates*  and  retrievals.  Table  4.1  shows  the 
results  that  were  obtained. 
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MIN 

MAX 

NUMBER 

INSERTS/ 

DISK 

DISK 

OF 

DikiIE.s_ 

ACCESS 

ACCESS 

ENTRIES 

ELEMENT 

9 

12 

100000 

AGGREGATE 

10 

13 

50000 

RECORD 

10 

13 

10000 

FILE 

10 

13 

10000 

PROGRAM 

10 

13 

1000 

SYSTEM 

4 

5 

100 

DATABASE 

6 

6 

10 

USER 

5 

6 

1000 

WEIGHTED 

AVERAGE 

9 

12 

MODIFY 

WEIGHTED 

AVERAGE 

3 

6 

RETRIEVE 

WEIGHTED 

AVERAGE 

3 

4 

Table  4.1 

Number 

of  disk  accesses 

to  perf 

desired  operation 

for  the  specific 

entity. 

L 
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Transmission  Time 

To  determine  the  time  to  transmit  an  inauiry  to 
a remote  site*  the  traffic  and  message  delays  of  the 

Advanced  Research  Project  Agency  (ARPA)  network  were 

examined  C 1 7 1 . For  a single  full  packet  message  (1000 
bits)  to  be  sent  between  two  close  sites  and  be 

acknowleoged  reaui red  50  msec.  Acknowledge  means  a 
notification  that  the  messaae  sent  was  received.  For  a 
message  to  be  sent  cross-country  involving  five 
locations  required  190  msec,  Each  site  has  a 

store-and-forward  Interface  Pessaoe  Processor  (IPP), 
The  time  required  to  process  a store-and-forward  packet 
is  about  0,35  msec.  An  F-packet  message  sent  between 
two  close  sites  required  195  msec.  These  results  were 
obtained  while  the  network  was  operating  with  a light 
load. 

Limitations 


Certain  limitations  were  imposed  to  keep  the 


simulation  simple. 

The 

network 

installations  all  of 

which 

were 

UP 

process 

messages  at 

a 1 1 

times. 

No 

simulate 

failed  nodes 

, However, 

t he 

J 


with  down  nodes  in  a network  will  be  discussed  later 
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The  messaqes  simulated  were  single  full  rackets, 
Fach  installation  had  one  data  communication  line 
ooeratina  in  a half-duplex  mod*.  This#  of  course#  is 
not  the  ideal  configuration  in  a real  world  situation# 
but  it  allowed  the  simulations  to  proceed  without 
introducing  a great  deal  of  complexity.  Ideally  there 
may  be  one  or  more  data  communication  lines  with  bU 
logical  paths  operating  in  a half-duplex  mode. 

Response  time  was  considered  to  be  the  time  from 
entering  the  request  until  an  acceptable  response  was 
received.  The  response  may  be  either  the  desired  data 
or  a failure  to  find  the  data.  In  some  cases  one 
request  may  triqger  a search  of  the  network  for  an 
answer.  In  this  case  the  response  was  not  sent  until  a 
response  was  received  from  all  the  sites  that  were 
queried.  This  may  seem  like  a severe  limitation#  but 
the  system  having  the  information  will  take  longer  to 
respond  and  therefore#  the  neqative  responses  will  be 
completed  sooner.  Since  each  configuration  followed  the 
same  limitations  no  desc r epanc i es  were  introduced. 

The  processing  time  for  an  entry  was  considered 
to  be  the  time  computed  necessary  to  perform  the  disk 
accesses  for  that  entity  activity,  Actual  processing 
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time*  aside  from  disk  accessing  was  considered  minimal 


and*  therefore*  was  not  included. 


The  simulations  were  run  on  the  DIC-10  system 
using  the  General  Purpose  Simulation  System  (GPSS) 
languaae,  The  advance  block  was  used  to  simulate 
c ommun i c a t i on / t r an sm i s s i on  delays  and  the  processing  of 
a dictionary  inouiry.  The  times  used  to  simulate  these 
delays  are  represented  in  table  4,2.  The  time  to  enter 
or  delete  an  entry  from  the  GDD  was  two  disk  accesses  or 
100  msec.  A randomness  was  added  which  would  vary  this 
100  msec  time  by  25  msec*  but  maintain  the  100  msec 
average , 


MIN 

AVE 

MAX 

LOCAL  TRANSMISSION 

3 

10 

17 

REMOTE  TRANSMISSION 

50 

120 

190 

RETRIEVAL 

150 

175 

200 

INSERT/DELETE 

450 

525 

600 

MODIFY 

150 

225 

300 

ENTER/DELETE  FROM  GDD 

75 

100 

125 

GDD  LOOK-UP 

90 

100 

110 

LDD  LOOK-UP 

110 

120 

130 

Table  4,2  Number  of  milliseconds  to  perform 
the  specific  operation 
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In  order  to  consider  each  con f i ourat 1 on  in  its 
best  situation  and  worst  situation  four  parameters  were 
used.  These  parameters  consisted  of!  1.  the  time 
between  requests;  2,  remote  versus  local;  3,  update 
versus  retrieval;  and  4,  i nse r t /de 1 e t e versus  mooitv. 
Three  time  intervals  were  considered:  1,  500  msec;  2, 
1000  msec;  and  3,  5000  msec.  Since  there  were  seven 
installations  inputting  messages  during  the  same  time 
intervals  the  average  for  having  a message  in  the 
network  was  actually  one-seventh  the  time  interval.  In 
other  words»  there  were  an  averaqe  of  seven  messages  in 
the  network  at  the  same  time. 


Centralized  Data  Dictionary 


The  centra)  data  dictionary  was  located  at 
installation  four;  and  therefore  local  transmission  was 
considered  practically  instantaneous  but  was  delayed  an 
average  of  10  msec.  Inputs  to  the  dictionary  would 
therefore  occur  randomly  from  the  seven  different 
locations.  The  dictionary  activity;  whether  it  was  an 
insert;  delete;  modify;  or  retrieval;  was  processed 
serially.  The  response  was  generated  following  the 
completion  of  the  query.  If  a message  was  in  the 
process  of  being  transmitted  from  that  same  site  for 


another  request;  the  response  would  be  queued  up  until 


*0 


the  line  was  free  before  proceeding, 
configuration  there  was  no  need  to  include 
versus  remote  parameter  because  there  was 
dictionary#  and  for  all  sites  except  number 
message  would  be  remote. 


In  this 
a local 
on  1 y one 
four#  the 


Fully  Redundant  Data  Dictionary 

Since  each  installation  had  a complete 
dictionary  all  messages  could  be  answered  from  the  local 
site.  Random  inputs  were  generated  from  each  of  the 
seven  locations  with  eguivalent  interarrival  rates. 
Each  message  would  have  a local  transmission  delay. 
Then  if  the  reguest  was  to  retrieve#  the  look-up  would 
be  performed  and  the  results  transmitted  back  to  the 
originator.  Again  if  the  data  transmission  lines  were 
in  use#  the  seizing  message  would  have  to  wait  until  the 
lines  were  free.  If  the  reguest  was  to  update  the 
dictionary  either  for  an  insert#  delete#  or  modify  an 
existing  entry#  while  the  update  was  in  progress  at  the 
local  site#  six  remote  messages  would  be  sent  out  to  the 
other  sites  to  perform  the  desired  reguest,  A response 
would  not  be  sent  to  the  originator  until  all  remote  and 
local  messages  had  been  completed.  It  thus  resulted  in 
slow  response  times  on  certain  occasions. 
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Part  i t i one  d Data  Dictionary 

In  the  partitioned  data  dictionary  messaqes 
originated  fro*  seven  locations  randomly,  Fach  averaged 
the  same  specified  number  of  milliseconds  and  incurred  a 
local  transmission  delay.  In  this  configuration 
messages  could  be  for  either  the  local  dictionary  or  a 
remote  one,  A local  message  could  either  be  a retrieval 
or  an  update.  In  either  case  the  desired  function  would 
be  performed  and  a response  generated  to  the  originator 
as  soon  as  the  line  was  free.  If  the  message  was  for 
non-local  information*  si*  remote  requests  would  be 
broadcast  throughout  the  network.  One  additional 
constraint  was  imposed  in  this  network.  Updates  could 
only  be  done  against  the  local  dictionary.  No  one  could 
update  another  dictionary.  Therefore*  all  remote 
messages  were  retrievals  only.  The  remote  site  that  had 
the  information  was  determined  randomly  and  the 
retrieval  was  then  performed.  The  response  to  the 
oriqinator  included  the  remote  transmissions  to  and  from 
the  other  sites*  plus  the  dictionary  look-uo  and  the 
local  transmission  delays. 
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Partially  Replicated  Dictionary 


In  the  partially  replicated  dictionary  each  of 
the  messages  incurred  a local  transmission  delay  and  was 
also  qenerated  randomly.  If  the  message  was  aqainst  the 
local  dictionary,  then  a further  breakdown  was 
necessary.  If  it  was  a request  for  some  information, 
then  the  data  was  retrieved  and  a response  generated. 
For  a modify  update  the  mooification  was  made  and  a 
response  sent.  For  an  insert  or  delete,  si*  remote 
updates  to  the  other  global  dictionary  directories  were 
qenerated,  A response  was  qenerated  upon  completion  of 
all  the  updates.  If  the  messaqe  was  for  retrieving 
remote  information  the  specific  installation  was  known 
and  the  data  retrieved.  All  remote  transmissions 
incurred  a time  delay. 


Results  Analyzed 

The  results  of  the  test  simulations  were  very 


reveal i nq  but 

in  most 

cases 

expected. 

Graph 

4,1  reports 

the  results  of 

running 

1000 

messages, 

at 

f i ve  second 

i nterval s,  aqai nst 

the 

general 

data 

dictionary, 

con( i aurat i on 

A , As 

was 

expec  t ed 

the  response 
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that  for  every  update  resulting  in  an  insert  or  delete 
in  configuration  Dr  seven  GDD's  were  modified.  As  the 
updates  increased  in  number  the  spread  became  more 
distinct. 


Switching  to  the  other  end  of  the  sped  run\>  we 
find  a definite  rise  in  response  time  for  both 
configurations.  Graph  *4 , 4 shows  the  results,  While  the 
number  of  updates  were  smallr  con f i au r a t i on  C and  D were 
not  as  qood  as  A or  B,  This  is  because  80%  of  the 
message  traffic  was  for  non-local  data.  This  resulted 
in  a search  of  the  LDD  in  configuration  Cr  and  a search 
of  the  GDD  in  configuration  D followed  by  remote 
transmissions.  As  the  number  of  updates  increased*  C 
and  D steadily  moved  away  from  A and  B * showed  an 
improved  response  time.  At  this  end  of  the  scale  D 
showed  a markedly  better  response  time  than  C,  Also 
notice  that  when  updating  exceeded  50%*  D was  also 
better  than  A or  B,  Similar  results  were  obtained  when 
the  four  configurations  were  run  with  message  intervals 
at  one  second  and  one  half  second.  The  difference  was 
as  the  interval  became  smaller  the  response  time 
increased. 
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Further  Cons i qerat < pns 

The  response  time  for  conf iouration  0 could  have 
been  improved  in  two  ways.  First#  if  the  look-up  time 
for  the  GDD  was  faster  than  the  LDD  look-uo  in 
configuration  C»  w h < c h it  should  be#  this  would  result 
in  an  improvement  of  response  time.  Second#  when  the 
GDD  was  referenced#  if  it  pointed  to  the  LDD  where  the 
data  were  contained#  then  a second  search  of  the  GDD 
would  not  be  necessary.  This  would  result  in  a small 
increase  in  update  time#  but  would  tremendously  reduce 
retrieval  time. 


Graph  4.1  Central  data  dictionary  response  curve  as  the  fraction 
of  updates  and  inserts/deletes  increase 


Graph  4.2  Fully  redundant  configuration  response  ourve.  The  first  fraotion 
represents  updates  and  the  seoond  fraction  represents  inserts/deletes 


Graph  4,3  Partitioned  and  partially  replicated  configurations  response  curve, 
with  2Q5C  messages  remote.  The  first  fraotion  represents  updates  and  the  seoond 
fraction  represents  inserts/deletes 


Chapter  5 
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DISTRIBUTION  PROBLEMS 


There  are  a number  of  problems  associated  with 
distributed  systems.  Each  of  the  simulated 
distributions  displayed  some  of  these  problems.  The 
centralized  data  dictionary  had  the  advantaoe  of  simple 
uodat i ng--one  copy.  Also  if  any  remote  site  went  down 
nothinq  in  the  data  dictionary  became  inaccessible. 
Locking  in  the  network  to  perform  updates  was  not  a 
problem.  However*  inquiries  were  handled  serially  and 
were  very  slow  at  peak  periods.  Moreover*  if  the 
installation  that  provided  the  data  dictionary  went 
down,  the  whole  network  was  down. 


Fully  Redundant 


In  the  fully  redundant  data  dictionary  system 
access  is  local  and  very  fast.  If  any  installation  in 
the  network  goes  down  nothing  in  the  data  dictionary 
becomes  inaccessible.  On  the  other  hand*  any  changes  to 
the  dictionary  requires  multiple  updates--one  for  each 
site.  The  integrity  of  the  dictionary  would  be 
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extremely  critical  and  hard  to  maintain.  If  a site 
failed#  provi sons  would  have  to  be  made  to  provide 
adequate  recovery  to  insure  absolute  inteqrity. 

Partitioned 

In  the  partitioned  data  dictionary  any  auery 
aaainst  local  data  would  be  almost  instantaneous.  Local 
updates  would  not  affect  any  other  sites.  If  any  site 
went  down  the  impact  would  only  be  minimal.  The 
dictionary  integrity  would  be  a simpler  problem.  Remote 
traffic#  inquiries  against  non-local  data#  would  result 
in  a slower  response  because  the  inquiry  would  need  to 
be  broadcast  throughout  the  network.  This  is  because 
tne  local  installation  has  no  knowledge  of  where  the 
remote  data  is  stored.  This  increase  in  traffic  on  the 
data  communication  lines  would  become  oreoonderous  and 
would  be  far  from  ideal. 

Partially  Replicated 

This  distribution  provides  ouick  access  for 
local  traffic  and  is  much  faster  for  remote  traffic 
because  only  one  other  location  is  searched.  Insertions 
and  deletions  are  the  only  updates  affecting  the  remote 
GDDs  which  are  short  and  simple.  Modify  updates  affect 
only  the  Local  Dictionary  Directory,  Any  site  lost 
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would  only  prevent  remote  Queries  from  beino  processed 
as  with  the  partitioned  data  dictionary.  Like  the  fully 
redundant  data  dictionary  system,  provisions  for 
handlinq  updates  to  failed  sites  would  have  to  be 
provided  for.  This  last  distribution  is  similar  to  the 
system  used  by  Computer  Corporation  of  America  in  their 
desiqn  of  SOD-1  (System  for  Distributed  Databases)  (IS), 

Other  Problems 

Two  other  problems  in  distributed  systems  also 
„ need  discussion.  First*  su r v i v ab i 1 i t y -- t he  system  must 

continue  to  operate  despite  any  site  failures  or 
inaccessability  of  one  or  more  databases.  Second# 
reliability  and  inteqritv  of  the  database  must  be 
maintained.  This  includes  restorina  downed  sites  to  the 
current  position  by  processing  any  updates  that  occurred 
while  the  installation  was  down.  To  increase 
survivability#  each  of  the  seven  LDDs  in  both  C and  D 
could  overlap  with  one  or  more  of  the  other  sites.  This 
would  allow  more  remote  reauests  to  be  answered  at  the 
local  site  instead  of  being  transmitted  to  another  site. 
It  also  means  less  data  will  be  inaccessable  because  of 
a node  failure.  This#  however#  involves  redundant  data 
which  introduces  a new  probl em--redundant  updating.  The 
easiest  method  of  handling  this  problem  is  to  lock  that 
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portion  of  the  databases  involved  in  the  update.  In  a 
distributed  system,  this  would  be  intolerable, 

Thomas  119)  proposed  a method  which  reduces  the 
volume  of  inter-module  communications  for  locking  by 
usinp  a votina  protocol.  Only  a majority  of  the  sites 
are  communicated  with,  Alsberg  and  Day  120)  sugqest 
working  with  only  one  site--the  "primary  site."  All 
update  activity  is  handled  throuqh  a single  primary 
site. 


Computer  Corporation  of  America  121)  sought  to 
avoid  all  locking  in  SDD-1.  They  determined  that  it  is 
not  necessary  for  all  copies  to  be  instantaneously 
identical  at  all  times.  It  is  sufficient  that  the  same 
final  state  be  achieved  if  all  update  activity  were  to 
cease.  The  update  of  redundant  copies  would  be 
accomplished  by  a specially  designed  protocol.  All 
messages  were  assiqned  a particular  class  by  the  D3A, 
In  addition,  each  message  was  assigned  a timestamp 
followed  by  a two  digit  site  number.  Once  a clock  had 
been  read  it  could  not  be  read  again  until  it  had 
advanced  one  unit  of  time.  This,  along  with  the  two 
digit  site  number,  gave  a globally  unioue  timestamp. 


5^ 

Messages  are  assigned  a class  depending  on  their 


loqical  read-set  and  write-set,  ^essges  of  the  same 
class  process  in  timestamp  order,  ^essaqes  of  different 
classes  are  analyzed  by  a process  described  by  Bernstein 
122]  to  determine  the  amount  of  synchronization  that  is 
required.  This  approach  guarantees;  first*  that  after 
a finite  period  of  time*  that  the  same  logical  data  item 
will  retrieve  the  same  value.  This  means  that  all 
physical  copies  of  a logical  data  item  will  converge  to 
the  same  value.  Second*  that  the  interleaved  operation 
of  messages  is  reproducible  to  an  operation  that  runs 
seri ally. 

The  problem  of  bringing  failed  sites  up  to  the 
current  position  could  be  solved  by  having  the  receiving 
installation*  as  acknowledgement*  seno  a message  to  the 
originating  installation  after  having  once  received  the 
messaqe.  In  the  case  of  the  ARPA  network  this 
acknowledgement  was  a RFNM  (Reouest  For  Next  Message) 
which  took  0,35  msec  to  generate.  If  this 
acknowledgement  is  never  received  then  the  sending 
installation  would  re-issue  the  messaae. 
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R pe  ommenda  t ion 

It  is  not  possible  to  know  in  advance  exactly 
what  kind  of  activity  will  be  present  on  a system  or 
network.  Possible  activity  can  be  estimated.  However, 
care  must  be  taken  to  prevent  being  locked  Into  a 
configuration  that  is  not  flexible.  The  partitioned 
data  dictionary  and  the  partially  replicated  data 
dictionary  represent  the  best  response  times  over  a 
ranoe  of  message  traffic.  The  partitioned  data 
dictionary,  however,  cloas  up  the  communication  lines 
rather  ouickly  when  reauests  are  made  for  non-local 
data. 


Heoardless  of  the  fact  that  these  simulated 
conf iqurat ions  were  rather  simple,  a close  examination 
of  the  characteristics  and  behavior  of  each  provides 
firm  evidence  that  a partially  replicated  data 
dictionary  system  is  the  most  efficient  and  flexible  for 
use  in  a distributed  environment.  The  most  efficient 
physical  organization  for  this  dictionary  system  is  an 
organization  with  secondary  keys  removed  from  the  data 
files  into  a chained  index.  The  indexes  would  be 
searched  by  a hash  index  technloue. 
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Summary 

This  thesis  has  explained  the  development  of  a 
data  dictionary  from  the  initial  step  of  determininq 
what  should  be  included  in  a data  dictionary. 

The  data  dictionary  must  include  entities  such 
as  data  items#  groups#  records#  files#  systems# 
programs#  and  users.  Each  of  these  entities  will  have 
descriptive  attributes  documenting  their  specific 
features.  These  will  include!  how  the  entity  is  used# 
its  name#  which  will  include  alias  names#  synonym  name# 
and  program  names#  a complete  description  of  the  entity 
and  the  security  level  necessary  to  access  the 
instances.  How  these  entities  relate  to  each  other  is 
an  essential  part  of  this  dictionary  database.  This 
information  allows  concerned  users  and  management  to 
answer  Questions  like!  which  programs  use  the  A/R 
master  file#  who  is  authorized  to  update  the  customer 
record  file#  and  how  many  data  elements  comprise  the 
inventory  record.  The  data  dictionary  provides  answers 
concerning  existing  programs  and  systems#  as  well  as# 
information  about  the  affects  of  future  changes. 
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Element  | Synonym 
I Name 


Description  Usage  Length  Version 


Element-Name 

COBOL-Name 

Fortran-Name 

As  s embly-Name 

Element-Name 

Date— Created 

Date-Updated 

Version 

Element-Name 

Security 

Classification 

Database-Name 

5.  Element-Name  Version 


System 


Program 


6.  Element-Name  Synonym-Name 


Alias-Name 


Views  of  Elements 


Aggregate-Name 

Synonym-Name 

Description 

Version 

Aggregate-Name 

COBOL-Name 

Fortran-Name 

Ass  embly-Narae 

Aggregate-Name 

Date-Created 

Date-Updated 

Version 

Aggregate-Name 

Synonym-Name 

Contents 

Position 

Alias-Name 


Aggregate-Name 

Security 

Classification 

Database-Name 

6.  Aggregate-Name  Version 


System 


Program 


Views  of  Aggregates 


Program 


Views  of  Programs 


BEGIN  DATA  BASE  STOKE: 


PASSWORDS ! 
1 DBA; 


ITEMS: 


<<  DATA  BASE  ADMINISTRATOR  HAS  Rf  AD>> 
<<  AND  WRITE  PERMISSION  ON  ALL  ITEMS>> 


<<COMKENTS>> 

<<COMMENTS>> 


ACCESS# 

X10 

(1/1) ; 

ALIAS-NAMES# 

X32 

( 1 / 1 ) j 

ASSEMBLY-NAME, 

X 32 

(1/1 ) ; 

CHARACTERISTICS, 

x«o 

(1/1) ; 

CLASSIFICATION, 

X10 

(1/1); 

COBOL-NAME, 

X 32 

(1/1) : 

CONTENTS, 

X 32 

(1/1) ; 

date-created, 

Xfe 

(1/1); 

DATE-UPDATED, 

Xfe 

(1/1); 

DBA, 

X2« 

(1/1) ; 

DBA-DET  AILS, 

xao 

(1/1) : 

DESCRIPTION, 

X72 

(l/l) : 

fortran-name , 

X 6 

(1/1) ; 

LENGTH, 

J3 

(l/l) ; 

pgm-called, 

X 32 

(1/1 ) ; 

AGGREGATE-NAME, 

X32 

(1/1); 

database-name, 

X 32 

(1/1) ; 

element-name. 

X 32 

(1/1) ; 

FILE-NAME, 

X 32 

(1/1) : 

PRIMARY-KEY, 

X 32 

(1/1); 

PROGRAM, 

X32 

(1/1) ; 

RECORD-NAME, 

X32 

(1/1); 

SYSTEM, 

X 32 

(1/1); 

USER, 

X 32 

( 1/1 ) ; 

POSITION, 

J2 

(1/1) ; 

programmer. 

X32 

(1/1); 

SECURITY, 

X 1 2 

(1/1) ; 

SORT-ORDER, 

X 8 

(1/1) ; 

SOURCE, 

X8 

(1/1) ; 

STRUCTURE, 

X 1 2 

(1/1); 

SYNONYM-NAME , 

X 32 

( 1/1 ) ; 

USAGE, 

Xf> 

(1/1) ; 

VERSION, 

X 6 

( 1/1 ) ; 

<<mmddyy>> 


<<mmddyy>> 


NAME:  MSTR-SYNONYM, AUTOMATIC ( 1/1 ) ; 

ENTRY:  SYNONYM-NAME (8) , <<NUMBER  OF  DETAILS  SETS>> 

capacity:  iooo; 


NAME:  MS TR -DATABASE, MANUAL!  1/1); 

ENTRY:  DATABASE-NAME ( 10)  ; 
CAPACITY:  10; 


NAME:  VSTR-ELEMENT,MANUAL( 1/1 ) ; 

ENTRY:  ELEMENT-NAME(4)  ; 

CAPACITY:  100000; 


NAME:  MSTR-AGGPEGATE, manual!  l/l ) ; 
ENTRY:  AGGREGATE-NAME !5)  ; 

CAPACITY:  10000; 


NAME:  MSTR-RECORD, MANUAL! 1/1) ; 

ENTRY:  RECORD-NAME!S)  ; 

CAPACITY:  10000; 


NAME:  MSTR-FILE, MANUAL! 1/1 ) ; 
ENTRY:  F ILE-NAME ! S)  ; 
CAPACITY:  10000; 


NAME:  MS TR -PROGRAM, MANUAL!  1/1  ) ; 

ENTRY;  PROGRAM ! R)  ; 

CAPACITY:  10000; 


NAME:  MSTR-SOURCE, AUTOMATIC! 1/1) f 

ENTRY:  SOURCE  ! 1 ) ; 

CAPACITY:  1000; 


NAME:  MSTR-PROGRAMMER, AUTOMATIC!  1/1) ; 

ENTRY;  PROGRAMMER!  1)  ; <<SECOND AR Y INDICE>> 
CAPACITY:  1000; 


NAME : MS  I R -SYSTEM, MANUAL ( 1 /I ) ; 

ENTRY:  SYSTEM(B)) 

CAPACITY:  100; 


NAME:  MSTR-USER»MANUAL(1/1)> 

ENTRY:  USER ( 7 ) ; 

CAPACITY:  1000; 


NAME:  MSTR-ALIAS,AUTOMATIC(l/l) ; 

ENTRY;  AL  I AS-NAME ( 5) ; 

CAPACITY:  200000; 


NAME:  ELEMENT-DETAIL, DETAIL ( 1/1 ) ; 

ENTRY:  ELEMENT-NAME(MSTR-ELEMENT) , 
SYNONYM-NAME (MSTR- SYNONYM) , 
DESCRIPTION, 

USAGE, 

LENGTH, 

VERSION, 

COROL-NAME, 

FORTRAN-NAME, 

ASSEMBLER-NAME, 

DATE-CREATED, 

DATE-UPDATED, 

SECURITY, 

CLASSIFICATION, 

DAT  ABASE (MSTR-DAT ABASE) ; 

CAPACITY:  100000; 


NAME:  AGGREGATE-DETAIL, DETAIL ( 1/1 ) ; 

ENTRY:  AGGREGATE -NAME (MSTR- AGGREGATE) , 
SYNONYM-NAME (MSTR-SYNONYM) , 
DESCRIPTION, 

VERSION, 

COQOL-NAME, 

FORTRAN-NAME, 

assembler-name, 

DATE-CREATED, 

DATE-UPDATED, 

SECURITY, 

CLASSIFICATION, 

D AT AR A SE-N AMEC MSTR-DAT ABASE)  ; 
CAPACITY:  10000; 
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NAME  5 RECORD-DET AIL, DETAIL!  1/1 ) ; 

ENTRY:  RfCOBD-NAME(MSIR-RECORD), 

synonym-name (MSTR- SYNONYM ) , 
DESCRIPTION, 

VERSION, 

ACCESS, 

PRIMARY-KEY, 

COBOL-NAME, 

FORTRAN-NAME, 

ASSEMBLER-NAME, 

DATE-CREATED, 

DATE-UPDATED, 

SECURITY, 

CLASSIFICATION, 

DAT  ABA SE -NAME ( MS TR-DAT ABASE)  ; 

CAPACITY?  10000; 


NAME?  FILE-DETAIL, DETAILU/1)  ; 

ENTRY?  FILE-NAME(MSTR-FILE) , 

SYNONYM-NAME (MSTR-SY NON YM) , 
DESCRIPTION, 

DATE-CREATED, 

DATE-UPDATED, 

VERSION, 

STRUCTURE, 

SORT-ORDER, 

SECURITY, 

CLASSIFICATION, 

DATABASE-NAME (MSTR-D AT ABASE ) ? 

CAPACITY?  10000; 


NAME?  PROGRAM-DETAIL, DETAIL!  1/1) J 
ENTRY;  PROGRAM(MSTR-PROGRAM) , 

SYNONYM-NAME (MSTR-SYNONYM) , 
DESCRIPTION, 

DATE-CREATED, 

DATE-UPDATED, 

VERSION, 

SECURITY, 

CLASSIFICATION, 

DATABASE-NAME (MSTR-D AT  ABASE ) , 
SOURCE (MSTR-SOURCE) , 
PROGRAMMER(MSTR-PROGRAMMER) , 
CHARACTERISTICS; 

CAPACITY?  10000; 
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NAME:  SYSTEM-OET AIL.UFT AIL(  1/1 ) J 

ENTRY:  SYSTFM(MSTR-SYSTEM) , 

DAT ARASE-NAMF (MSTR-DATABASE), 
SYNONYM-NAME(MSTR-SYNONYM) , 
DESCRIPTION, 

SECURITY, 

CLASSIFICATION, 

CAPACITY:  100; 


NAME:  DAT  ABASE- DETAIL, DETAIL, (1/1); 

ENTRY:  DAT  ABASE -NAME (MSTR-DATA HA SE ) , 

SYNONYM-NAME (MS TR-SYNONYM) , 
DBA, 

DBA-DET  AILS; 

CAPACITY:  10; 


NAME:  USER-DETAIL, DETAIL  C 1/1  3 ; 

ENTRY:  USER-NAME(MSTW-USER) , 

SYNONYM-NAME (MSTR- SYNONYM) ; 
CAPACITY:  1000; 


NAAE:  ELEMENT-USER, DETAIL!  1/1 ) ; 

ENTRY:  ELEMENT-NAME (MSTR-ELEMENT)  , 
USER(MSTR-USER)  ; 

CAPACITY:  300000; 


NAME:  ELEMENT-SYSTEM, DETAIL, (1/1); 

ENTRY : E LEMENT -NAME (MSTR-ELEMENT) , 
SYSTEM(MSTR-SYSTEM) , 
PROGRAM(MSTR-PROGRAM) ; 

CAPACITY:  300000; 


NAME;  ELEMENT-ALIAS, DETAIL! 1/1)  ; 
ENTRY:  ELEMENT -NAME (MSTR-ELEMENT) , 
ALIAS-NAME (MSTR- ALIAS)  ; 
CAPACITY:  300000; 


NAME:  AGGREGATE-USER, DETAIL!  1/1)  ; 

ENTRY:  AGGi' EG  A TE -NAME  (MSTR- AGGREGATE  ) , 
USER(MSTR-USER) ; 

CAPACITY:  30000; 


NAME:  AGGREGATE-CONTENTS* DETAIL ( 1/1 ) 
ENTRY:  AGGREGATF-NAWEC^STR-AGGREGATE) 
CONTENTS, 

POSITION; 

CAPACITY:  50000; 


NAME:  AGGREGATE -SYSTEM, DETAIL C 1/1 ) ; 

ENTRY:  AGGREGATE-NAME (MSTR- AGGREGATE) 
SYSTEM(MSTR-SYSTEM) , 

PROGR AM ( MS TR -PROGRAM)  ; 
CAPACITY:  50000; 


NAME:  AGGREGATE-ALIAS, DETAIL ( 1/1 ) ; 

ENTRY:  AGGREGATE -NAME (MSTR- AGGREGATE) 
ALIAS-NAME (MSTR- ALIAS) ; 
CAPACITY:  30000; 


NAME:  RECORD-CONTENTS, DETAILC  1/1)  ; 

ENTRY:  RECORD-NAME (MSTR-RECORD) , 
CONTENTS, 

POSITION; 

CAPACITY:  40000; 


NAME:  RECORD-USER, DETAILC 1/1 ) ; 

ENTRY:  RECORD-NAME (MSTR-DECORD), 
USER-NAME(MSTR-USER)  ; 
CAPACITY:  20000; 


NAME:  RECORD-SYSTEM, DETAILCl/l)  ; 

ENTRY:  RECORD-NAME (MSTR-RECORD) , 
SYSTEM-NAME (MSTH-SYSTEM) , 
PROGR AM (MSTR -PROGR AM) ; 
CAPACITY:  20000; 


NAME:  RECORD-AL I AS, DETAIL(  1/1 ) ; 

ENTRY:  RECORD-NAME (MSTR-RECORD) , 
ALIAS -NAME (MSTR- ALIAS); 
CAPACITY:  30000; 


NAME:  FILE-CONTENTS, DETAIL(1/1)  ; 

ENTRY:  FILE(MSTR-FILE) , 

CONTENTS; 

CAPACITY:  30000; 
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NAME ; FILE-USER,DET  AR  ( 1 / I ) ; 

ENTRY:  FILE(MSTR-FILE), 

USER(MSTR-USER)  : 

CAPACITY:  20000; 


NAME:  FILE-SYSTEM, DETAIL!  1/1)  ; 

ENTRY:  FILE (MSTR-F  ILE  ) , 

SYSTEM(MSTR-SYSTEM) , 
PROGRAMIMSTR-PROGRAM) j 
CAPACITY:  30000; 


NAME:  FILE-ALIAS, DETAIL(1/1)  ; 

ENTRY:  FILE(MSTR-F ILE) , 

ALI AS-NAME(MSTR- ALIAS)  ; 
CAPACITY:  20000; 


NAME:  PROGRAM-ALIAS, DETAILO/1)  ; 

ENTRY:  PROGRAM! MSTR-PROGRAM) , 
ALIAS-NAME(MSTR-ALIAS); 
CAPACITY:  20000; 


NAME:  PROGRAM-SYSTEM, DETAILCl/l)  ; 

ENTRY:  PROGRAM(MSTR-PROGRAM) , 
SYSTEM(MSTR-SYSTEM) ; 
CAPACITY:  30000; 


NAME:  PROGRAM-USER,  DETAILU/1)  ; 

ENTRY:  PROGRAM(MSTR-PROGRAM) , 
USER(MSTR-USER)  ; 

CAPACITY:  40000  ; 


NAME:  PROGRAM-PROGRAM, DETAIL!  1/1 ) ; 

ENTRY:  PROGRAM!MSTR-PROGRAM) , 

pgms-called> 

CAPACITY:  60000; 


NAME:  DATABASE-SYSTEM, DETAIL!  1/1)  ; 

ENTRY:  DATABASE-NAME1MSTR-DATABASE), 
SYSTEM!MSTR-SYSTEM) ; 

CAPACITY:  40; 


l 


END* 


i 


JIIH  L 0 i*Nt  I'UJUaL  UAIA  UlC  I JOUAlM 

SIMULA  If 


t UUMUL  l»«l*  UIJ.  I ItlilAMlLb  Ot'S.'ilO  /|Sol  I*.  KAbh 

Jl*Ul^  l»Pii  *S;uu 


7 LOCAL  Dill  QIC  I IUNAR1ES  CPSS 1 0 7156)  6110  1U-JIIN-/M  PAOl 

SIMUl)  CPS  J0-JUN-7D  BlOt 


7 COU/LDO  UM*  OldlUkURlEJ  UPSS  1 0 7S5b)  bill  10-Jutl-l# 

SIHUL*  CPS  30-JUN-7S  6110 


> CDD/IDO  0*U  DICIIUNbWUS  UPSS10  7 1 S b ) 6112  JO-JIIN-78  P»lil 

SIHULb  CPS  30-JUM-U  6110 


DEVELOPMENT  OF  A DATA  DICTIONARY;  FOR  USE  I N 
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ABSTRACT 

A data  dictionary  was  developed  that  can  he  used 
on  an  Integrated  database  that  is  geographically 
distributed.  The  content*  logical  representation*  and 
physical  represent  at  1 on  of  the  data  dictionary  were 
presented.  A study  of  the  best  way  to  distribute  this 
dictionary  was  made  by  simulating  several  proposed 
distributions.  Thousands  of  messages  were  run  against 
each  simulated  distribution.  The  best  distribution  was 
found  after  determining  the  average  response  time  of  a 
messane . 


